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^ , Abstract. We study how much memory one-pass compression algorithms need to compete with the 

O ' best multi-pass algorithms. We call a one-pass algorithm an f{n, ^)-footprint compressor if, given n, 

j /' j ' £ and an n-ary string S, it stores S in (0{Hi{S)) + o(logn)) \S\ + 0{n^'^^ ^ogn) bits — where Hi{S) 

is the ^th-order empirical entropy of S — while using at most f{n,£) bits of memory. We prove that, 
' for any e > and some f(n,£) £ 0(n^^'' log n), there is an /(n, £)-footprint compressor; on the other 

hand, there is no /(n, i')-footprint compressor for f(n,£) £ o(n^ log n). 



1 Introduction 



One-pass compression has been studied extensively and many of the best compression algo- 
rithms — e.g., LZ77, LZ78 and PPM — are one-pass algorithms. It is useful when compress- 
• ing files too large to fit in internal memory, since each page need be brought from external 
^ ■ memory only once; it is crucial when compressing data that are time-sensitive and must 
be tranmitted as they are generated, or from data streams (see, e.g., P) — sources that 
produce so many data they cannot all be stored and must be manipulated as they pass. 
Even one-pass algorithms have a disadvantage, though: most build and store models and, 
in general, the better the compression, the bigger the model. There are several techniques 
O ■ for limiting one-pass algorithms' memory consumption, such as using a sliding window over 

the data, but it is not always clear how these affect compression. In this paper we prove 
^ i worst-case bounds on how much memory one-pass algorithms need to compete with the best 
I multi-pass algorithms. 

The first notable work on one-pass compression was done by Faller [8] in 1973 and Gal- 
lager |9j in 1978, toward a dynamic version of Huffman's algorithm [10]. In 1985 Knuth [12j 
used their results in an algorithm that takes time linear in the size of its output. It was 
the basis of the Unix utility compact and is sometimes known as the FGK algorithm — the 
authors' initials — to distinguish it from an improved version of dynamic Huffman coding 
that Vitter p!8] published in 1987. In the meantime, Ziv and Lempel had invented their well- 
known algorithms LZ77 [19] and LZ78 [20], which were the bases of, e.g., the Unix utility 
compress, gzip, WinZip® and, more recently, 7-Zip. Dynamic arithmetic coding was de- 
veloped in the 1980s (see, e.g., [15]) and, generalizing it to use context, PPM (see, e.g., [5]); 
although PPM is somewhat slower and less space efficient than other popular algorithms, it 
is generally considered to achieve the best compression. Finally, Bentley, Sleator, Tarjan and 



Wei [2] and Elias [7] independently invented move-to-front compression, which is especially 
intersting for us because its space complexity does not depend on its input's length; we use 
this property in Section [3] when proving our upper bound. 

Algorithms based on the Burrows- Wheeler Transform |3] represent the state of the art in 
multi-pass compression — e.g., bzip2. Kaplin, Landau and Verbin pT] recently proved one 
such algorithm, given an ra-ary string S, stores S in (^cHi(S) + logC(c) + 15*1 + cn^^^ \ogn 
bits, for any c > 1 and i > 0. By Hg{S) we mean the £th-order empirical entropy of 
S] log means log2; C is the Riemann zeta function; and r is the redundancy of a simpler 
compression algorithm (e.g., arithmetic coding or Huffman coding) chosen as a subroutine. 
Their analysis was based on an earlier one by Manzini [H], which established empirical 
entropy as a popular complexity metric. Since even Manzini's analysis is relatively recent, 
we discuss empirical entropy in Section [2l To compare one-pass and multi-pass algorithms 
theoretically, we use Kaplan, Landau and Verbin's analysis as a benchmark. We call a one- 
pass algorithm an f{n, £)-footprint compressor if, given n, i and an n-ary string S, it stores S 
in (^0{H£{S)) + o(logn)^ 15*1 + 0{n^~^^\ogn) bits while using at most f{n,i) bits of memory. 
In Section [3] we prove nearly tight bounds on footprints: for any e > and some f{n,£) G 
0{n^~^'^\ogn), there is an /(n, £)-footprint compressor; for f{n,i) G o(?T,^logn), however, 
there is no /(n, £)-footprint compressor. 



2 Empirical entropy 

Markov processes have long been the most popular models for many kinds of data; e.g.. 
Shannon [17] fitted zeroth-, first- and second-order Markov processes to English, gave sam- 
ples of their output and wrote "the resemblance to ordinary English text increases quite 
noticeably at each of the above steps", and "a sufficiently complex stochastic process will 
give a statisfactory representation" of natural written language. Of course, there is a impor- 
tant difference between representation and equivalence: Chomsky [4j argued a probabilistic 
model cannot determine whether a novel sentence is grammatical (e.g., "Colorless green ideas 
sleep furiously.") or not (e.g., "Furiously sleep ideas green colorless.") and concluded "the 
notion 'grammatical in English' cannot be identified in any way with the notion 'high order 
of statistical approximation to English' " ; in other words, people are not Markov processes. 

The self- information of a string S with respect to a source M — i.e., the negative loga- 
rithm of the probability M generates S — measures how many bits we need to store S when 
using M as a model. The £th-order empirical entropy Hi{S) of S is '\-/\S\ times the minimum 
self-information of 5* with respect to an £th-order Markov process. Thus, comparing Hi{S) 
to the logarithm of the alphabet size tells us how much we can compress S when using an 
£th-order Markov process as a model — regardless of how S is generated. As Manzini p3] 
wrote, "the empirical entropy resembles the entropy defined in the probabilistic setting (for 
example, when the input comes from a Markov source) [but] is defined for any string and 
can be used to measure the performance of compression algorithms without any assumption 
on the input." 
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Another way to view the £th-order empirical entropy of S is as our expected uncertainty 
about a randomly chosen character, given a context of length i. Let denote the ith character 
of S and consider the following experiment: i is chosen uniformly at random from {1,...,|S'|}; 
if i < £, then we are told sf, otherwise, we are told Si-g- ■ ■ Si-i. Our expected uncertainty 
about the random variable Si — its expected entropy — is 



if £ = 0, 



aeS 

1 

W\ 



S\ "°#a(5) 

\S^\-Ho{S^) if£>l. 

\a\=e 



Here, a G means character a occurs in S; i^a{S) is the number of occurrences of a in 
S] and Sa is the string obtained by concatenating the characters immediately following 
occurrences of string a in S* — the length of Sa is the number of occurrences of a in unless 
a is a suffix of S, in which case it is 1 less. Notice that, if n is the alphabet's size, then 
Hi^i{S) < Hi{S) < logn for all i. For example, if S is the string TORONTO, then 

Ho{S) = i log? + ^ log ^ + i log? + ^ log ^ ^ 1.84 , 



Hi{S) = - (HoiSn) + 2HoiSo) + HoiS^) + 2ffo(5'T 



? 

= ^ [Ho{T) + 2ffo(RN) + Ho{0) + 2/fo(00) 
= 2/? ^ 0.29 

and all higher-order empirical entropies of S are 0. This means if someone chooses a character 
uniformly at random from TORONTO and asks us to guess it, then our uncertainty is 
about 1.84 bits. If they tell us the preceding character before we guess, then on average 
our uncertainty is about 0.29 bits; if they tell us the preceding two characters, then we are 
certain of the answer. 



3 Upper and lower bounds on footprints 

Our upper bound is based on a generalization of move-to- front compression (MTF). To 
compress an n-ary string S, MTF starts with a list storing the numbers from to n — 1; for 
each character Sj in S, it prints Sj's position in the list, encoded with Elias' delta code [6], 
then moves Sj to the front of the list. Despite its simplicity, MTF is quite efficient: it stores S 
in (^Hq{S) + 2\og{Ho{S) + 1) + ij \S\+0{n\ogn) bits while using 0(n logn) bits of memory. 

Theorem 1. For any e > and some f{n,i) G 0{n^'^'^logn) , there is an f{n,i) -footprint 
compressor. 
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Proof. Consider the algorithm An/ that, given an n-ary string S, keeps a hst Qa of maximum 
size [n^J for each possible £-tuple a. For I < i < £, Ani prints the [log n] -bit binary 
representation of the ith character Sj of S. For i > £ + 1, if Sj is stored in Qs^_^,...,s^_i, 
then An/ prints 1 followed by Sj's position in Qs^_l,...,Si^-^ — encoded with the delta code 

— and moves Si to the front; otherwise, An/ prints followed by the [logn]-bit binary 
representation of Sj, then inserts Sj at the front and, if necessary, deletes the last character 
in <3.,_,,...,«,_i. 

It takes 0{n^^'^\ogn) bits to store all lists. After An/ has printed Si,...,S£ in bi- 
nary, it becomes equivalent to r/ copies of Anfl'- for each possible £-tuple a, a copy of 
An,o operates on the string Sa obtained by concatenating the characters immediately fol- 
lowing occurrences of a in S. Thus, by the definition of empirical entropy, if Anfl prints 
(0{HQ{Sa)) + o{\ogn)j \Sa\ + 0{n\ogn) bits when given any Sa, then An/ prints 

J2 {{Ho{Sa)) + o{\ogn)) \Sa\+0{n'+Hogn) 

\a\=e 

= [0{He{S)) + o{logn)) \S\ + 0{n^+^ log n) 

bits while using 0(n^"^^logn) bits of memory — i.e.. An/ is an /(n, £)-footprint compressor 
for some f{n,i) G O {n''''^'^ log n) . 

If e > 1, then Anfl behaves like MTF, except that it prepends a 1 to each codeword. 
Suppose e < 1, Anfl is given Sa and that, when processing some character Sj, MTF would 
print the encoding of x in the delta code. If a; < [n^J , then Anfl prints a 1 followed by the 
same codeword; if x > [n^J — in which case the codeword for x in the delta code is at least 
elogn bits — then Anfl prints a followed by a [logn]-bit number. In either case, Anfl 
prints at most 1/e times the number of bits MTF would print, plus 2. Thus, Anfl stores Sa 
m^^[Ho{S) + 2log{Ho{Sa) + l) + 3) \Sa\ + 0{nlogn) e [O {Ho{S a)) + O {log log n)) \Sa\ + 
0{nlogn) bits. □ 

Our lower bound is based on arguments about objects' Kolmogorov complexities fl3j. 
The Kolmogorov complexity K{X) of an object X is the minimum space needed to store X 

— more formally, the length in bits of the shortest program that returns X; the choice of 
any Turing-equivalent programming language does not affect K{X) by more than a constant 
term. By a simple diagonalization, Kolmogorov complexity is neither computable nor even 
approximable, so it is most often used for proving lower bounds. In our proof, we use three 
properties of Kolmogorov complexity: the Kolmogorov complexity of any fixed, finite object is 
a constant; if an object can be computed from other objects, then its Kolmogorov complexity 
is at most the sum of theirs plus a constant; and, in any set of size k, nearly all elements 
have Kolmogorov complexity i7(log/c). 

Theorem 2. For f{n,i) G o(r;,^ log n), there is no f{n, i)- footprint compressor. 

Proof. Assume there is an /(n, £)-footprint compressor A. As a one-pass algorithm, ^'s fu- 
ture behaviour is determined only by its /(n, £)-bit memory and its unread input; thus, the 
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function A{n,£, ■) is computed by some 2-^("'^)-state transducer. Let Tn/ be the lexicograpi- 
cally first such transducer and let 6* be its extended transition relation. We can construct 
Tn/ from A, n and i, so K{Tn^i) G o(n^ log n). 

Let Sn/ be the first characters of any n-ary linear sequence de Bruijn sequence of 
order ^. An n-axy linear de Bruijn sequence of order ^ is an n-ary string containing each 
possible ^-tuple exactly once; it has length + I — 1 and its first and last I — 1 characters 
are the same. Notice HilS^^) = for any positive k. Let B be the shortest binary string 
such that there are reachable states qi and q2 in T„ £ with (gi, Sn/, B, ^2) G <5*; i.e., if T„ ^ is 
state qi and reads Sn/, then it prints B and stops in g2- By the definition of a compressor, 
A (n, i, Sf^g'j G o ( S'^£ logn) = o(n'^+^ logn), so \B\ G o(n^ log n). 

There cannot be another string S'^^ such that (gi, 5"^ £, -B, 52) G 6*; otherwise, for some 
n-ary prefix P, we would have A{n, £, PS'^ = A{n, i, PSni) and A would not be lossless. 
Thus, we can construct Sn^i from T„ £, gi, q2 and B, so K{Sni) < 2 log |T„ £| + o(n^logn) = 
2f{n,i) + o(n^ logn). On the other hand, there are (n!)"^ ^ n-ary linear de Bruijn sequences 
of order i [16], so K{Sni) G i7(n^logn) for nearly all choices of Sn/- Therefore, f{n,£) G 
i7(n^logn). □ 
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